Intro to DBMS
database: organized collection of inter-related data that models some aspect of the real world.
- db are the core component of most computer applications
- some common use such as we store our db as comma-separated value(csv) files.
database management system(DBMS): software that allows applications to store and analyze information in a database.
- a general-purpose DBMS supports the definition, creation, querying, update, and administration of databases in accordance with the data model.
- data model: a collection of concepts for describing the data in a database.
- some common data models are relational, key-value, graph, document, etc. This course will focus on relational model.
- NoSQL data model: Key-Value, Graph, Document/XML/Object, Wide-Column/Column-Family, etc. (named not only SQL, but generally do not use SQL)
- Array/Matrix/Vector also can be considered as a data model and widely used for machine learning.
- some other data models: hierarchical, network, Multi-value, Object-relational, etc. They are kind of obsolete/legacy/rarely used.
- schema: a description of a particular collection of data, using a given data model.
Why we should have a dbms? Imagine you store your data in flat files (e.g. csv):
- data integrity: no constraints on data(e.g. some people like to store date as MM/DD/YYYY, some like DD/MM/YYYY)
- efficiency: scan whole file to find data,
- concurrency: multiple users can access the same data at the same time, what if they update at same time?
- durability: if the system crashes, how to recover the data?
Relational Model
relational model*: a database abstraction based on relations to avoid maintenance overhead. Key tenets(principles) of the relational model:
- Store db in simple data structures(relations)
- Physical storage left up to the DBMS implementation
- Access data through high-lvl languages, let DBMS figures best strategy to execute
To implement relational model, we need to define:
- Structure: the definition of the db's realtions and their contents
- relation: an unordered set that contains the relationship of attributes that represent entities
- tuple: a set of attribute values in a relation, sometimes we call it attribute values' domain
- primary key of a relation: uniquely identifies a single tuple in the relation
- DBMS can auto-gen unique primary keys:
IDENTITY
in SQL Server,AUTO_INCREMENT
in MySQL,SEQUENCE
in PostgreSQL
- DBMS can auto-gen unique primary keys:
- foreign key: specifies that an attribute from one relation maps to a tuple in another relation. (e.g. student id in a course relation is a foreign key to the student relation)
- Integrity: the constraints of the db's contents
- Constraints: user-defined conditions that must hold for any instance of the database
- Manipulation: programming interface for accessing and modifying a db's contents
Relational Algebra
We need DML(data manipulation language) to manipulate data in a db. There are two types of DML:
- Procedural DML(i.e. the relational algebra): query that specifies high-level strategy to find the desired result based on sets/bags
- Non-procedural DML(i.e. relational calculus): query that specifies only what data is wanted and not how to find it.
The language is comprised of different classes of commands:
- Data Manipulation Language (DML): SELECT, INSERT, UPDATE, and DELETE statements.
- Data Definition Language (DDL): Schema definitions for tables, indexes, views, and other objects.
- Data Control Language (DCL): Security, access controls.
Relational Algebra Operators
The operators of relational algebra are based on set algebra and each operator takes one or more relations as its input and outputs a new relation. The operators are:
- (select): Choose a subset of the tuples from a relation that satisfies a selection predicate, ()
- predicate here acts as a filter to retain only tuples that satisfy the requirement
- predicate can use , , to combine multiple predicates
- SQL:
SELECT * FROM R WHERE predicate
- (projection): Generate a relation with tuples that contains only hte specified attributes, ()
- rearrange, remove, modify and etc. the attributes of a relation
- SQL:
SELECT attribute1, attribute2 FROM R
- (union): Generate a relation that contains all tuples that appear in either only one or both of the input relations, ()
- SQL:
(SELECT * FROM R) UNION (SELECT * FROM S)
- SQL:
- (intersection): Generate a relation that contains all tuples that appear in both of the input relations, ()
- SQL:
(SELECT * FROM R) INTERSECT (SELECT * FROM S)
- SQL:
- (difference): Generate a relation that contains only the tuples that appear in the first and not hte second of the input relations, ()
- SQL:
(SELECT * FROM R) EXCEPT (SELECT * FROM S)
- SQL:
- (product): Generate a relation that contains all combinations of tuples from the two input relations, ()
- SQL:
SELECT * FROM R, S
orSELECT * FROM R CROSS JOIN S
- SQL:
- (join): Generate a relation that contains all tuples that are a combination of two tuples (one from each input relation) with a common values for one or more attributes, ()
- SQL:
SELECT * FROM R JOIN S ON predicate
orSELECT * FROM R NATURAL JOIN S
orSELECT * FROM R JOIN S USING (attribute)
- SQL:
- (rename): Generate a relation that contains all tuples from the input relation but with a new name, ()
- SQL:
SELECT * FROM R AS new name
- SQL:
- (assignment): Assign the result of a relational algebra expression to a relation, ()
- SQL:
CREATE TABLE R AS expression
- SQL:
- (duplicate elimination): Generate a relation that contains all tuples from the input relation but with duplicate tuples removed, ()
- SQL:
SELECT DISTINCT * FROM R
- SQL:
- (aggregation): Generate a relation that contains all tuples from the input relation but with duplicate tuples removed, ()
- SQL:
SELECT attribute1, attribute2 FROM R GROUP BY attribute1, attribute2
- SQL:
- (sorting): Generate a relation that contains all tuples from the input relation but with duplicate tuples removed, ()
- SQL:
SELECT * FROM R ORDER BY attribute1, attribute2
- SQL:
- (division): Generate a relation that contains all tuples from the input relation but with duplicate tuples removed, ()
- SQL:
SELECT * FROM R WHERE NOT EXISTS (SELECT * FROM S EXCEPT SELECT * FROM R)
- SQL: